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^_ A class of variable selection procedures for parametric models via 

^isi , nonconcave penalized likelihood was proposed by Fan and Li to simul- 

taneously estimate parameters and select important variables. They 
demonstrated that this class of procedures has an oracle property 
pH , when the number of parameters is finite. However, in most model 

^0 ' selection problems the number of parameters should be large and 

(-H , grow with the sample size. In this paper some asymptotic properties 

" ■ of the nonconcave penalized likelihood are established for situations 

in which the number of parameters tends to oo as the sample size 
increases. Under regularity conditions we have established an oracle 
property and the asymptotic normality of the penalized likelihood 
estimators. Furthermore, the consistency of the sandwich formula of 
the covariance matrix is demonstrated. Nonconcave penalized likeli- 
v^ ' hood ratio statistics are discussed, and their asymptotic distributions 

W^ , under the null hypothesis are obtained by imposing some mild condi- 

^+ ' tions on the penalty functions. The asymptotic results are augmented 

\^ ' by a simulation study, and the newly developed methodology is illus- 

("^ , trated by an analysis of a court case on the sexual discrimination of 

nJ ' salary. 

o 
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H ' 1.1. Background. The idea of penalization is very useful in statistical 

K»" ■ modeling, particularly in variable selection, which is fundamental to the 

field. Most traditional variable selection procedures, such as Akaike's infor- 
;_^ ■ mation criterion AIC [Akaike (1973)], Mallows' Cp [Mallows (1973)] and the 

Bayesian information criterion BIC [Schwarz (1978)], use a fixed penalty on 
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the size of a model. Some new variable selection procedures suggest the use 
of a data adaptive penalty to replace fixed penalties [i.e., Bai, Rao and Wu 
(1999) and Shen and Ye (2002)]. However, all these procedures follow step- 
wise and subset selection procedures to select variables. Stepwise and subset 
selection procedures make these procedures computationally intensive, hard 
to derive sampling properties, and unstable [see, e.g., Breiman (1996) and 
Fan and Li (2001)]. In contrast, most convex penalties, such as quadratic 
penalties, often produce shrinkage estimators of parameters that make trade- 
offs between bias and variance such as those in smoothing splines. However, 
they can create unnecessary biases when the true parameters are large and 
parsimonious models cannot be produced. 

To overcome the inefficiency of traditional variable selection procedures. 
Fan and Li (2001) proposed a unified approach via nonconcave penalized 
least squares to automatically and simultaneously select variables and es- 
timate the coefficients of variables. This method not only retains the good 
features of both subset selection and ridge regression, but also produces 
sparse solutions (many estimated coefficients are 0), ensures continuity of 
the selected models (for the stability of model selection) and has unbiased 
estimates for large coefficients. This is achieved by choosing suitable penal- 
ized nonconcave functions, such as the smoothly clipped absolute deviation 
(SCAD) penalty that was proposed by Fan (1997) (to be defined in Section 
2). Other penalized least squares, such as the bridge regression proposed by 
Frank and Friedman (1993) and Lasso proposed by Tibshirani (1996, 1997), 
can also be studied under this unified work. The nonconcave penalized least- 
squares approach also corresponds to a Bayesian model selection with an 
improper prior and can be easily extended to likelihood-based models in 
various statistical contexts, such as generalized linear modeling, nonpara- 
metric modeling and survival analysis. For example, Antoniadis and Fan 
(2001) used this approach in wavelet analysis, and Fan and Li (2002) ap- 
plied the technique to the Cox proportional hazards model and the frailty 
model. 

1.2. Nonconcave penalized likelihood. One distinguishing feature of the 
nonconcave penalized likelihood approach is that it can simultaneously select 
variables and estimate coefficients of variables. This enables us to establish 
the sampling properties of the nonconcave penalized likelihood estimates. 

Let log fiy, (3) be the underlying likelihood for a random vector V. This 
includes the likelihood of the form i{X /3, Y) of the generalized linear model 
[McCullagh and Nelder (1989)]. Let px{\(^j\) be a nonconcave penalized func- 
tion that is indexed by a regularization parameter A. The penalized likeli- 
hood estimator then maximizes 

n p 

(1-1) j2^ogf{v„(3)-Y.pxim). 
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The parameter A can be chosen by cross-vahdation [see Breiman (1996) and 
Tibshirani (1996)]. 

Various algorithms have been proposed to optimize such a high-dimensional 
nonconcave likehhood function. The modified Newton-Raphson algorithm 
was proposed by Fan and Li (2001). The idea of the graduated nonconvexity 
algorithm was proposed by Blake and Zisserman (1987) and Blake (1989), 
and was implemented by Nikolova, Idier and Mohammad- Dj atari (1998). 
Tibshirani (1996, 1997) and Fu (1998) proposed different algorithms for the 
Lp-penalty. One can also use a stochastic optimization method, such as sim- 
ulated annealing. See Geman and Geman (1984) and Gilks, Richardson and 
Spiegelhalter (1996) for more discussions. 

For the finite parameter case, Fan and Li (2001) established an "oracle 
property," to use the terminology of Donoho and Johnstone (1994). If there 
were an oracle assisting us in selecting variables, then we would select vari- 
ables only with nonzero coefficients and apply the MLE to this submodel 
and estimate the remaining coefficients as 0. This ideal estimator is called an 
oracle estimator. Fan and Li (2001) demonstrated that penalized likelihood 
estimators are asymptotically as efficient as this ideal oracle estimator for 
certain penalty functions, such as SCAD and the hard thresholding penalty. 
Fan and Li (2001) also proposed a sandwich formula for estimating the stan- 
dard error of the estimated nonzero coefficients and empirically verifying the 
consistency of the formula. Knight and Fu (2000) studied the asymptotic be- 
havior of the Lasso type of estimator. Under some appropriate conditions, 
they showed that the limiting distributions have positive probability mass at 
when the true value of the parameters is 0, and they established asymptotic 
normality for large parameters in some sense. 

1.3. Real issues in model selection. In practice, many variables are intro- 
duced to reduce possible modeling biases. The number of introduced vari- 
ables depends on the sample size, which reflects the estimability of the para- 
metric problem. 

An early reference on this kind of problem is the seminal paper of Neyman 
and Scott (1948). In the early years, from problems in X-ray crystallogra- 
phy, where the typical values for the number of parameters p and sample 
size n are in the ranges 10 to 500 and 100 to 10,000, respectively, Huber 
(1973) noted that in a variable selection context the number of parameters 
is often large and should be modeled as pn, which tends to oo. Now, with 
the advancement of technology and huge investment in various forms of data 
gathering, as Donoho (2000) demonstrated with web term-document data, 
gene expression data and consumer financial history data, large sample sizes 
with high dimensions are important characteristics. He also observed that 
even in a classical setting such as the Framingham heart study, the sample 
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size is as large as A^ = 25,000 and the dimension is p = 100, which can be 
modeled as p = 0{n^''^) or p = 0{n^'"^). 

Nonparametric regression is another class of examples that uses diverging 
parameters. In spline modeling an unknown function is frequently approxi- 
mated by its finite series expansion with the number of parameters depend- 
ing on the sample size. In regression splines, Stone, Hansen, Kooperberg 
and Truong (1997) regard nonparametric problems as large parametric prob- 
lems and extend traditional variable selection techniques to select important 
terms. Smoothing splines can also be regarded as a large parametric problem 
[Green and Silverman (1994)]. To achieve the stability of the resulting esti- 
mate (e.g., smoothness), instead of selecting variables a quadratic penalty 
is frequently used to shrink the estimated parameters [Cox and O'SuUivan 
(1990)]. Thus, our formulation and results have applications to the problem 
of nonparametric estimation. 

Fan and Li (2001) laid down important groundwork on variable selection 
problems, but their theoretical results are limited to the finite-parameter 
setting. While their results are encouraging, the fundamental problems with 
a growing number of parameters have not been addressed. In fact, the full 
advantages of the penalized likelihood method in model selection have not 
been convincingly demonstrated. For example, for finite-parameter prob- 
lems, owing to the root-n-consistency of estimated parameters, many naive 
and simple model selection procedures also possess the oracle property. To 
wit, a simple thresholding estimator such as i3jI{\Pj\ > n~^'^), which com- 
pletely ignores the correlation structure and the scale of the parameter, also 
possesses the oracle property. Thus, it is uncertain whether the oracle prop- 
erty of Fan and Li (2001) is genuine to the penalized likelihood method or 
an artifact of the finite-parameter formulation. 

To this end, we consider the log-likelihood series log/,i(V^,/3n), where 
/n(Ki,/3n) is the density of the random variable Vn, all of which relate to 
the sample size n, and assume without loss of generality that, unknown to 
us, the first Sn components of /?„, denoted by /?„!, do not vanish and the 
remaining p„ — s„ coefficients, denoted by f3n2, are 0. Our objectives in this 
paper are to investigate the following asymptotic properties of a nonconcave 
penalized likelihood estimator. 

1. (Oracle property.) Under certain conditions of the likelihood function 
and for certain penalty functions (e.g., SCAD), if Pn does not grow too 
fast, then by the proper choice of A„ there exists a penalized likelihood 
estimator such that /?„2 = and Pni behaves the same as the case in 
which /3„2 = is known in advance. 

2. (Asymptotic normality.) As the length of /3„i depends on n, we will con- 
sider its arbitrary linear combination AnPni, where An is a g x s„ matrix 



NONCONCAVE PENALIZED LIKELIHOOD 5 

for any finite q. We will show that this linear combination is asymptot- 
ically normal. Furthermore, let (5^i be the oracle estimator, thus maxi- 
mizing the likelihood of the ideal submodel J27=i^og fn{Vi, (3ni) ■ We will 
show that AnP^i is also asymptotically normal. We will study the con- 
ditions under which the two covariance matrices are identical. This will 
demonstrate the oracle property mentioned above. 

3. (Consistency of the sandwich formula.) Let S„ be an estimated covari- 
ance matrix for /3ni, using the sandwich formula based on the penalized 
likelihood (1.1). We will show that the covariance matrix S„ is a consis- 
tent estimate in the sense that j4^S„j4„ converges to the qxq asymptotic 
covariance matrix of AnPni ■ 

4. (Likelihood ratio theory.) If one tests the linear hypothesis Hq : Anfini = 
and uses the twice-penalized likelihood ratio statistic, then this statistic 
asymptotically follows a x^ distribution. 

The asymptotic properties of any finite components of /? are included 
in the above formulation by taking a special matrix An- Furthermore, the 
asymptotic properties and variable selection of linear components in any par- 
tial linear model can be analyzed this way if we use a series such as a Fourier 
series or polynomial splines to estimate the nonparametric component. 

1.4. Outline of the paper. In Section 2 we briefly review the noncon- 
cave penalized likelihood. The asymptotic results of penalized likelihood are 
presented in Section 3. We discuss the conditions that are imposed on the 
likelihood and penalty functions in Section 3.1 and present our main results 
in Sections 3.2-3.4. An application of the proposed methodology and a sim- 
ulation study are presented in Section 4. The proofs of our results are given 
in Section 5. Technical details are relegated to the Appendix. 

2. Penalty function. Penalty functions largely determine the sampling 
properties of the penalized likelihood estimators. To select a good penalty 
function. Fan and Li (2001) proposed three principles that a good penalty 
function should satisfy: unbiasedness, in which there is no overpenalization 
of large parameters to avoid unnecessary modeling biases; sparsity, as the 
resulting penalized likelihood estimators should follow a thresholding rule 
such that insignificant parameters are automatically set to to reduce model 
complexity; and continuity to avoid instability in model prediction, whereby 
the penalty function should be chosen such that its corresponding penalized 
likelihood produces continuous estimators of data. More details can be found 
in the work of Fan and Li (2001) and Antoniadis and Fan (2001). 

To gain some insight into the choice of penalty functions, let us first 
consider a simple form of (1.1), that is, the penalized least-squares problem: 

\{z-ef+p>.m. 



6 J. FAN AND H. PENG 

It is well known that the L2-penalty pA(|^|) = A|0p leads to a ridge regres- 
sion. A generalization is the Lg-penalty p^d^l) = '^l^l'^i q> I- These penalties 
reduce variability via shrinking the solutions, but do not have the properties 
of sparsity. 

The Li-penalty pa(|^|) = ^16*1 yields a soft thresholding rule 

^ = sgn(z)(|2;| - A)+. 

Tibshirani (1996, 1997) applied the Li-penalty to a general least-squares 
and likelihood setting. Knight and Fu (2000) studied the L^-penalty when 
q <1. While the L^-penalty {q < 1) functions result in sparse solutions, they 
cannot keep the resulting estimators unbiased for large parameters due to 
excessive penalty at large values of parameters. Another type of penalty 
function is the hard thresholding penalty function 

pxm) = x'-{\9\-xfi{\e\<x), 

which results in the hard thresholding rule [see Antoniadis (1997) and Fan 
(1997)] 

e = zi{\z\ >A), 

but the estimator is not continuous in the data z. 

As the penalty functions above cannot simultaneously satisfy the afore- 
mentioned three principles, motivated by wavelet analysis, Fan (1997) pro- 
posed a continuous differentiable penalty function called the smoothly clipped 
absolute deviation (SCAD) penalty, which is defined by 

p'^{e) = x\l{e < A) + 7^~7,^ J(g > A) 1 for some a > 2 and > 0. 
I (a — IjA J 

The solution for this penalty function is given by 

sgn(z)(|z| — A)+, when \z\ < 2A, 

{(a — l)z — sgn(2;)aA}/(a — 2), when 2A < \z\ < aX, 

z, when \z\ > aX. 

The solution satisfies the three properties that were proposed by Fan and 
Li (2001). 

3. Properties of penalized likelihood estimation. In this section we study 
the sampling properties of the penalized likelihood estimators proposed in 
Section 1 in the situation where the number of parameters tends to oo with 
increasing sample size. We discuss some conditions of the penalty and like- 
lihood functions in Section 3.1 and show their differences from those under 
finite parameters. Though the imposed conditions are not the weakest pos- 
sible, they make technical analysis easily understandable. Our main results 
are presented in Section 3.2. 
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3.1. Regularity conditions. 

3.1.1. Regularity condition on penalty. Let a^ = maxi<j<p^{p'^ (|/5nOj|)! 
PnOj / 0} and 6„ = m&xi<j<p^{p'^^{\PnOj\) , PnOj / 0}. Then we need to place 
the following conditions on the penalty functions: 

(A) hminf„„,+oohminfe^o+PA„(^)/'^" > 0; 

(B) a„ = 0(n-i/2). 

(B') a„ = o(l/ynp;;); 

(C) 6„ — > as n ^ +oo; 

(C) 6n = 0p(l/vp;:); 

(D) there are constants C and D such that, when Oi,02 > CXn, \px (^i) — 

pU(^2)\<D\6i-e2\. 

Condition (A) makes the penalty function singular at the origin so that 
the penalized likelihood estimators possess the sparsity property. Conditions 

(B) and (B') ensure the unbiasedness property for large parameters and the 
existence of the root-n-consistent penalized likelihood estimator. Conditions 

(C) and (C) guarantee that the penalty function does not have much more 
influence than the likelihood function on the penalized likelihood estimators. 
Condition (D) is a smoothness condition that is imposed on the nonconcave 
penalty functions. Under the condition (H) all of these conditions are satis- 
fied by the SCAD penalty and the hard thresholding penalty, as a^ = and 
bn = when n is large enough. 

3.1.2. Regularity conditions on likelihood functions. Due to the diverg- 
ing number of parameters, we cannot assume that likelihood functions are 
invariant in our study. Some conditions have to be strengthened to keep uni- 
form properties for the likelihood functions and sample series. A higher-order 
moment of the likelihood functions is a simple and direct method to keep 
uniform properties, as compared to the usual conditions in the asymptotic 
theory of the likelihood estimate under finite parameters [see, e.g., Lehmann 
(1983)]. The conditions that are imposed on the likelihood functions are as 
follows: 

(E) For every n the observations {Vni,i = 1,2, . . . ,n} are independent and 
identically distributed with the probability density /n(Kii )/?«), which 
has a common support, and the model is identifiable. Furthermore, 
the first and second derivatives of the likelihood function satisfy the 
equations 

f dlogfn{Vnl,(in) \ „ f ■ . o 

E,,^[ a^- 1=0 forj = l,2,...,p. 
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and 



E, 



Pu 



d\ogfn{Vnl,[in) d\og fn{Vnl, l^n) 

(F) The Fisher information matrix 

'dlogfn{Vnl,Pn) 



-E, 



IniPn) 



d(3n 



d''\ogfn{Vnu(in 
aiog/„(Kl,/3n) ^^^ 



satisfies conditions 

< Ci < \mm{In{(3n)} < Amax{/n(/?n)} < C2 < OO 

and, for j,A; = l,2,...,p„, 



for ah n 



^ > d\0gfn{Vnl,[in)d\0gfn{Vnl,(in) \^ ,^ ^ 

EpA J775 ^tt; \ <C3<oo 



9A 



nj 



dPnk 



and 



E, 



g'log/n(Kl,/3n )l' 



< C4 < 00. 



(G) There is a large enough open subset lj„ of 0^ G i?^" which contains 
the true parameter point /?„,, such that for almost all Vni the density 
admits all third derivatives dfniVni,Pn)/dPnjPnkPni for all /3„ G cOn- 
Furthermore, there are functions M^jki such that 

d\og fn{Vni,(3n 



d(3njPnkPnl 



<Mnjkl{Vni) 



for all (3n G ^n, and 



EpAM^MVni)}<C5<^ 



for all p, n and j, k, I. 
(H) Let the values of Pn0i,Pn02, ■ ■ ■ ,PnOs„ be nonzero and /3„o{s„+i),/3n02, 
• • • , PnOpn be zero. Then /3nOi, /3n02, • • • , /3nOs„ satisfy 



mill |/3„oil/A^ 



OO 



as n 



OO. 



Under conditions (F) and (G), the second and fourth moments of the 
likelihood function are imposed. The information matrix of the likelihood 
function is assumed to be positive definite, and its eigenvalues are uniformly 
bounded. These conditions are stronger than those of the usual asymptotic 
likelihood theory, but they facilitate the technical derivations. 

Condition (H) seems artificial, but it is necessary for obtaining the or- 
acle property. In a finite-parameter situation this condition is implicitly 
assumed, and is in fact stronger than that imposed here. Condition (H) 
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explicitly shows the rate at which the penalized likelihood can distinguish 
nonvanishing parameters from 0. Its zero component can be relaxed as 

max \PnOj\/^n^O asn— >oo. 

Sn + l<j<Pn 

3.2. Oracle properties. Recall that Vni,i = 1, . . . , n, are independent and 
identically distributed random variables with density fn{Vn,Pno)- Let 

n 

LniPn) =^^Ogfn(yni,(3n) 
i=\ 

be the log-likelihood function and let 

Vn 

QniPn) =Ln{(3n) - n'^P\„{\(3nj\) 

i=i 
be the penalized likelihood function. 

Theorem 1 (Existence of penalized likelihood estimator). Suppose that 
the density fn{Vn, f3no) satisfies conditions (E)-(G), and the penalty function 
Px„{-) satisfies conditions (B)-(D). If pf^/n ^ as n^oo, then there is a 
local maximizer J3n of Q{Pn) such that ||/3„ — /3„o|| = Op{^yp^{n~^''^ + an)}, 
where an is given in Section 3.1.1. 

It is easy to see that if a„, satisfies condition (B), that is, a„ = 0(n~^'^), 
then there is a root-(n/p„)-consistent estimator. This consistent rate is the 
same as the result of the M-estimator that was studied by Huber (1973), in 
which the number of parameters diverges. The convergence rate of a„ for the 
usual convex penalties, such as the Lg-penalty with q>l, largely depends 
on the convergence rate of A„. As these penalties do not have an unbiased- 
ness property, they require that A„ satisfy the condition A„ = 0{n~^''^) in 
order to have a root- (n/p„) -consistent estimator for the penalized likelihood 
estimator. This requirement will make it difficult to choose A„ for penal- 
ized likelihood in practice. However, if the penalty function is a SCAD or 
hard thresholding penalty, and condition (H) is satisfied by the model, it is 
clear that a^ = when n is large enough. The root-(n/p„)-consistent penal- 
ized likelihood estimator indeed exists with probability tending to 1, and no 
requirements are imposed on the convergence rate of Xn- 

Denote 

Sa„ = diagKj/3„oi), . . . ,PA„(/3nO.J} 
and 

bn = {PA„(|/3nOl|) Sgn(/3„oi), • • • ,P'xJ\PnOsJ)sgn{PnOsJ}'^ ■ 
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Theorem 2 (Oracle property). Under conditions (A)-(H), if A„ -^ 0, 
\/n/pnXn — > oo and p^/n ^0 as n —> oo, then, with probability tending to 
1, the root- (n/pn)- consistent local maximizer /3„ = ( a"^ ) in Theorem 1 must 
satisfy: 

(i) (Sparsity) /3n2 = 0. 
(ii) (Asymptotic normality) 

X [/3„1 - (3nOl + UniPnOl) + ^X,T^K] ^AA(0,G), 

where An is aqx Sn matrix such that AnA^ — > G, and G is aqxq nonegative 
symmetric matrix. 

By Theorem 2 the sparsity and the asymptotic normality are still valid 
when the number of parameters diverges. In fact, the oracle property holds 
for the SCAD and the hard thresholding penalty function. When n is large 
enough, Sa„ = and b„ = for the SCAD and the hard thresholding 
penalty. Hence, Theorem 2(ii) becomes 

y/^Anlll\[3nOl)0nl " /^nOl) ^ AA(0,G), 

which has the same efficiency as the maximum likelihood estimator of /3„,oi 
based on the submodel with /3„o2 = known in advance. This demonstrates 
that the nonconcave penalized likelihood estimator is as efficient as the oracle 
one. Intrinsically, unbiasedness and singularity at the origin of the SCAD and 
the hard thresholding penalty functions guarantee this sampling property. 

The -Lq-penalty, (? > 1, cannot simultaneously satisfy the conditions A„ = 
Op{n~^''^) and ^Jn/pXn — > oo as n — > oo. These penalty functions cannot 
produce estimators with the oracle property. The L^-penalty, ^ < 1, may 
satisfy these two conditions at same time. As shown by Knight and Fu (2000) 
in a finite-parameter setting, it might also have sampling properties that 
are similar to the oracle property when the number of parameters diverges. 
However, the bias term in Theorem 2(ii) cannot be ignored. 

The condition p^/n — > or p^/n ^ as n — > oo seems somewhat strong. 
By refining the structure of the log-likelihood function, such as the gener- 
alized linear model i{X (5,Y) or the M-estimator from J27=iPO^i ~ -^iP)j 
the condition can be weakened to Pn/n — > as n — > oo. This condition is in 
line with that of Huber (1973). 

3.3. Estimation of covariance matrix. As in Fan and Li (2001), by the 
sandwich formula let 

S„ = n{V^Ln0ni) - nJ^xAPni)}'' 

X c5v{VL„(/3„i)}{v2l„(/3„i) - nSA„(/3„i)}-i 
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be the estimated covariance matrix of /3„i , where 
cov{VL.„(/3„i)} = < - 2^ ■ 



n ^ d^j d(ik 



_ J 1 y^ dLnijPnl) 1 / 1 Y^ dLnijPnl) \ 

Denote by 

Sn = {IniPnOl) + ^X„{PnOl)} In{PnOl){IniPnOl) + ^XniPnOl)} 

the asymptotic variance of /9„i in Theorem 2(ii). 

Theorem 3 (Consistency of the sandwich formula). // conditions (A)- 
(H) are satisfied and p^/n — > as n ^ oo, then we have 

(3.1) AntnAl-AnJ^nAl^O OS n ^ oo 

for any q x Sn matrix An such that A^A^ = G, where q is any fixed integer. 

Theorem 3 not only proves a conjecture of Fan and Li (2001) about the 
consistency of the sandwich formula for the standard error matrix, but also 
extends the result to the situation with a growing number of parameters. 
The consistent result also offers a way to construct a confidence interval for 
the estimates of parameters. For a review of sandwich covariance matrix 
estimation, see the paper of Kauermann and Carroll (2001). 

3.4. Likelihood ratio test. One of the most celebrated methods in statis- 
tics is the likelihood ratio test. Can it also be applied to the penalized 
likelihood context with a diverging number of parameters? To answer this 
question, consider the problem of testing linear hypotheses: 

Ho : AnPnOi = vs. Hi : AnfinOl + 0, 

where A^ is a g x s„ matrix and AnA^^ = Iq for a fixed q. This problem 
includes testing simultaneously the significance of a few covariate variables. 
In the penalized likelihood context a natural likelihood ratio test for the 
problem is 

T„ = 2(supg(/?„|V) - sup Q(/3„|V)1. 
L n„ n„,A„/3„i=o J 

The following theorem drives the asymptotic null distribution of the test 
statistic. It shows that the classical likelihood theory continues to hold for 
the problem with a growing number of parameters in the penalized likelihood 
context. It enables one to apply the traditional likelihood ratio method for 
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testing linear hypotheses. In particular, it allows one to simultaneously test 
whether a few variables are statistically significant by taking some specific 
matrix An- 

Theorem 4. When conditions (A)-(H), (B' ) and (C' ) are satisfied, un- 
der Hq we have 

(3.2) Tn^xl 

provided that p^/n^Q as n^ oo. 

For the usual likelihood without penalization, Portnoy (1988) and Murphy 
(1993) showed that the Wilks type of result continues to hold for specific 
problems. Our results can be regarded as a further generalization of theirs. 

4. Numerical examples. In this section we illustrate the techniques of 
our method via an analysis of a data set in a lawsuit and verify the finite- 
sample performance via a simulation experiment. 

4.1. A real data example. The Fifth National Bank of Springfield faced 
a gender discrimination suit. [This example and the accompanying data set 
are based on a real case. Only the bank's name has been changed, according 
to Example 11.3 of Albright, Winston and Zappe (1999).] The charge was 
that its female employees received substantially smaller salaries than its male 
employees. The bank's employee database (based on 1995 data) is listed in 
Albright, Winston and Zappe (1999). For each of its 208 employees the data 
set includes the following variables: 

• EduLev: education level, a categorical variable with categories 1 (finished 
high school), 2 (finished some college courses), 3 (obtained a bachelor's 
degree), 4 (took some graduate courses), 5 (obtained a graduate degree). 

• JobGrade: a categorical variable indicating the current job level, the pos- 
sible levels being 1-6 (6 highest). 

• YrHired: year that an employee was hired. 

• YrBorn: year that an employee was born. 

• Gender: a categorical variable with values "Female" and "Male." 

• YrsPrior: number of years of work experience at another bank prior to 
working at the Fifth National Bank. 

• PCJob: a dummy variable with value 1 if the empolyee's current job is 
computer related and value otherwise. 

• Salary: current (1995) annual salary in thousands of dollars. 

A naive comparison of the average salaries of males and females will not 
work, since there are many confounding factors that affect salary. Since our 
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main interest is to provide, after adjusting contributions from confounding 
factors, a good estimate for the average salary difference between male and 
female employees, it is very reasonable to build a large statistical model to 
reduce possible modeling biases. In building such a model the estimability of 
parameters is a factor in choosing the number of parameters, which depends 
on the sample size. 

Two models arise naturally: the linear model 



Salary = /3o + /3i Female + /32PC Job + ^ /32+iEdui 
(4.1) 

+ J2 /36+i JobGrdi + /3i2YrsExp + /3i3 Age + £ 



i=l 

and the semiparametric model 

4 
Salary = /3o + /3i Female + /32PCJob + 2J /32+jEduj 

(4.2) 

+ 5]/36+iJobGrdi + /i(YrsExp) + ^(Age) + e, 

where the variable YrsExp is the years of working experience, computed 
from the variables YrHired and YrsPrior, and /i and /2 are two continuous 
functions to be parameterized. Figure 1 shows the distributions of the years 
of working experience and age. To take into account the estimability, the 
number of parameters used in modeling /i and /2 depends on the sample 
size. This falls within the framework of our study. In our analysis we employ 
quadratic spline models: 

fi{x) = anx + ai2X^ + aa{x - Xji)+ H h aij{x - Xi^)\, i = l,2, 

(4.3) 

where xn,. . . , Xjs are, respectively, the 2/7, 3/7, . . . , 6/7 sample quantiles of 
the variables YrsExp (i = 1) and Age (i = 2). In other words, the knots 
for YrsExp are 6, 8, 10, 12 and 15, and for Age are 32, 35, 37, 42 and 46. 
The total number of parameters for the classical linear model (4.1) is 14, 
and for the semiparametric model (4.2) is 26. Clearly, model (4.2) incurs 
smaller modeling biases than model (4.1). Since the number of parameters 
in both models is large, we apply the penalized likelihood method to select 
significant variables. Similarly to Fan and Li (2001), a modified GOV has 
been applied to choose the regularization parameter A. Find A to minimize 



n {1 — 7e(A)/n}^ 
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where /3(A) is the penahzed least-squares estimate for a given A and e(A) 
is the effective number of parameters defined in Section 4.2 of Fan and Li 
(2001). The values 7 = 1 and 7 = 2.5 are applied in model (4.1) and model 
(4.2), respectively. 

There are a few cases that do not appear typical. We deleted the samples 
with age over 60 or working experience over 30. They correspond mainly to 
the company executives who earned handsome salaries. [Two of them are 
females who were over age 60, employed at ages 54 and 58 with no prior 
experience and at the lowest grade, and with just a high school education. 
While these two female employees had relatively low salaries, they should 
not be regarded as being discriminated against. Further, they have high 
leverages for model (4.1), particularly in the direction of age. Deleting these 
two cases is reasonable, either from a statistical or a practical point of view.] 
As a result of this deletion, a sample of size 199 remains for our analysis. 

We first applied the ordinary least-squares fit. The estimated coefficients 
and their associated standard errors are summarized in Table 1. To apply 
the penalized likelihood method (1.1), we need to take care of the scale 
problem for each covariate. We normalized each covariate variable by the 
estimated standard error from the ordinary least squares and then estimated 
the coefficients and transformed the coefficients back to the original scale. 
This is equivalent to applying the penalized parameter Xj = ASE(/3j) for 
covariate Xj, where SE(/3j) is the standard error of Pj for the ordinary 
least squares estimate. The SCAD penalty function is used throughout our 
numerical implementation. 

The penalized least-squares estimates are also presented in Table 1. For 
the semiparametric model (4.2) the regression components for the variables 
YrsExp and Age are shown in Figure 2. The residual plots against the vari- 
ables YrsExp and Age are shown in Figure 3. They do not exhibit any 
systematic patterns. 
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Fig. 1. Distributions of years of working experience and age. 
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Fig. 2. Regression components /i and /2 for the semiparametric model (^4.2j. 

The three models ah have high multiple E? — over 80% of the salary vari- 
ation can be explained by the variables that we use. None of the results 
shows statistical evidence of discrimination. The coefficients in front of the 
indicator of Female are negative, but statistically insignificant. 

We now apply the likelihood ratio test to examine whether there is any 
discrimination against female employees. This leads to the following null 
hypothesis: 

Even in the presence of a large number of parameters, according to Theorem 
4, the likelihood ratio theory continues to apply. Table 2 summarizes the 
test results under both models (4.1) and (4.2), using both penalized and 



Table 1 
Estimates and standard errors for Fifth National Bank data 



Method 


Least squares 


SCAD PLS 


SCAD PLS 


Intercept 


54.238 


2.067) 


55.835 


(1.527) 


52.470 


(2.890) 


Female 


-0.556 


0.637) 


-0.624 


(0.639) 


-0.933 


(0.708) 


PcJob 


3.982 


0.908) 


4.151 


(0.909) 


2.851 


(0.640) 


Edl 


-1.739 


1.049) 





(-) 





(-) 


Ed2 


-2.866 


0.999) 


-1.074 


(0.522) 


-0.542 


(0.265) 


Ed3 


-2.145 


0.753) 


-0.914 


(0.421) 





(-) 


Ed4 


-1.484 


1.369) 





(-) 





(-) 


Jobl 


-22.954 


1.734) 


-24.643 


(1.535) 


-22.841 


(1.332) 


Job2 


-21.388 


1.686) 


-22.818 


(1.546) 


-20.591 


(1.370) 


Jobs 


-17.642 


1.634) 


-18.803 


(1.562) 


-16.719 


(1.391) 


Job4 


-13.046 


1.578) 


-13.859 


(1.529) 


-11.807 


(1.359) 


Job5 


-7.462 


1.551) 


-7.770 


(1.539) 


-5.235 


(1.150) 


YrsExp 


0.215 


0.065) 


0.193 


(0.046) 


(-) 


(-) 


Age 


0.030 


0.039) 





(-) 


(-) 


(-) 


R^ 


0.8221 




0.8176 




0.8123 
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Table 2 
SCAD penalized likelihood ratio test 





Likelihood ratio test 


Penalized likelihood ratio test 




X^-statistic 


P-value 


X'^-statistic 


P-value 


Model (1.1) 


0.7607 


0.3831 


1.0131 


0.3142 


Model (1.2) 


1.8329 


0.1758 


1.4311 


0.2316 



unpenalized versions of the likelihood ratio test. The results are consistent 
with the regression analyses depicted in Table 1. 

One can apply the logarithmic transformation to the salary variable and 
analyze the transformed data. This would make the transformed data more 
normally distributed. We opted for the original scale for the sake of inter- 
pretability. Furthermore, the conclusion does not change much. 

While there is no significant statistical evidence for discrimination based 
on the above analyses, the arguments can still go on. For example, as intu- 
itively expected, the job grade is a very important variable that determines 
the salary. For this data set, it explains 77.29% of the salary variation. Now 
the question arises naturally whether it was harder for females employees to 
be promoted, after adjusting for variables such as working experience, age 
and education level. We do not pursue this issue further. 



4.2. A simulation study. In this section we use a simulation study to 
augment our theoretical results. To present a situation in which the number 
of parameters depends on n, to show the applicability of our results is wider 
than what we have presented and to create dependence between covariates, 
we consider the following autoregressive model: 



-Pn ~T~ ^1 ^ — 1) -^J • ■ • ) ''^7 



(4.4) X, = /3lXi„l+/32X,_.2 + ••• + /?p^^- 

where /? = (11/4, -23/6,37/12, -13/9, 1/3,0, ... ,0)^ and e is white noise 
with variance a'^. The number of parameters depends naturally on n, as time 
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Fig. 3. Residuals after fitting the semiparametric model ('^■2). 
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series analysts naturally wish to explore the order of fit to reduce possible 
modeling biases. This time series model is stationary since its associated 
polynomial 

*(B)=(l--)(l-B + - 

has no zero inside the unit circle. 

In our simulation experiments 400 samples of sizes 100, 200, 400 and 
800 with pn = [4n^'^] — 5 are drawn from model (4.4). The penalized least- 
squares method with the SCAD penalty is employed. The algorithm of Fan 
and Li (2001) is used. The medians of relative model errors (MRMEs) 
(/3 — f3)'^EX'^X{$ — P), among the least-squares (LS) estimator, the pe- 
nalized least-squares (PLS) estimator and the oracle estimator, measure 
the effectiveness of the methods. The results are summarized in Table 3. 
As expected, the oracle estimator performs the best and the PLS performs 
comparably with the oracle estimator for all sample sizes. The LS estimator 
performs the worst and its relative performance deteriorates as n increases. 
The average number of zero coefficients is also reported in Table 3, in which 
the column labeled "Correct" presents the average number restricted only 
to the true zero coefficients, and the column labeled "Incorrect" depicts the 
average number of coefficients erroneously set to 0. For example, for n = 400, 
among seven nonzero coefficients, on average 5.78 coefficients, or 83%, were 
correctly estimated as 0, and among five nonzero coefficients, on average 0.22 
coefficient was incorrectly estimated as 0. The medians of the estimated co- 
efficients over 400 simulated data sets are presented in Table 4. The biases 
are quite small, except for estimated P^, which is difficult to estimate. The 
variances of the estimated coefficients across 400 simulations are presented 
in Table 5. 

To test the accuracy of the standard error formula, the standard devi- 
ations of the estimated coefficients are computed among 400 simulations. 
These can be regarded as the true standard errors (columns labeled SD in 

Table 3 
Simulation results for the time series model 

Average number of 
MRME (%) zero coeficients 



n 


Pn 


Oracle/LS 


PLS/LS 


Oracle/PLS 


Correct 


Incorrect 


100 


7 


75.33 


89.21 


80.17 


1.34 [67%] 


0.49 


200 


10 


50.61 


69.64 


73.27 


3.91 [78%] 


0.39 


400 


12 


40.03 


59.57 


73.06 


5.78 [83%] 


0.22 


800 


16 


31.75 


49.05 


70.08 


9.49 [86%] 


0.10 
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Table 4 
Median of estimators for coefficients of time series model 



n 


Pn 


01 


02 


^3 


04 


^5 


100 


7 


2.678 


-3.616 


2.739 


-1.096 





200 


10 


2.711 


-3.696 


2.856 


-1.240 


0.242 


400 


12 


2.729 


-3.769 


2.959 


-1.333 


0.293 


800 


16 


2.737 


-3.792 


3.023 


-1.383 


0.306 


True 


— 


2.750 


-3.833 


3.083 


-1.444 


0.333 



Table 5) and compared with 400 estimated standard errors. The 400 esti- 
mated standard errors are summarized by their median (columns SD^) and 
interquartile range divided by 1.349 (columns SDmad)) which is a robust es- 
timate of the standard deviation. The results are presented in Table 5. The 
accuracy gets better when n increases. Further, the accuracy is better for 
the first two coefficients than for the last two coefficients, which have lower 
signal-to-noise ratios. 

We now apply the penalized likelihood ratio test to the following null 
hypothesis: 

The likelihood ratio statistic is computed for each simulation. The distribu- 
tion of these statistics among 400 simulations can be regarded as the true 
null distribution and can be compared with the asymptotic distribution. Fig- 
ure 4 depicts the estimated densities of the likelihood ratio statistics among 
400 simulations for n = 100 and 400. 

5. Proofs of theorems. In this section, we give rigorous proofs of Theo- 
rems 1-4. 

Proof of Theorem 1. Let a„ = Y^(n"^/^ + a„) and set ||u|| = C, 
where C is a large enough constant. Our aim is to show that for any given 

Table 5 
Standard deviations (multiplied by 1000^ of estimators for time series model 







01 




02 




03 




04 




0B 






SD^ 




SD^ 




SD^ 




SD^ 




SD^ 


n 


SD 


(SD^^d) 


SD 


(SD^ad) 


SD 


(SD^ad) 


SD 


(SD^ad) 


SD 


(SD^ad) 


100 


120 


91 (5.1) 


337 


230 (29.8) 


525 


285 (66.6) 


451 


177 (87.2) 


249 


79 (66.7) 


200 


76 


66 (2.8) 


221 


174 (15.2) 


340 


231 (58) 


348 


170 (87.2) 


243 


64 (49.5) 


400 


50 


47 (1.2) 


149 


126 (4.5) 


222 


169 (8.8) 


204 


125 (9.0) 


129 


47 (3.90) 


800 


35 


34 (0.7) 


99 


90 (3.1) 


145 


121 (8.5) 


132 


90 (14.1) 


63 


34 (12.5) 
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Fig. 4. Estimated densities of the likelihood ratio statistics for n — 100 (dot-dash) and 
n = 400 (long-dash) along with the density of the xi distribution (solid). 

e there is a large constant C such that, for large n we have 

(5.1) P\ sup Qni/3nO + anU)<Qn{l3no)\>l-e. 

l||u||=C J 

This implies that with probabihty tending to 1 there is a local maximum /3„ 
in the bah {PnO + OnU: ||u|| < C} such that \\(3n — Pno\\ = Op{an)- 
Using j'A„(0) = 0, we have 

Dn{u) = QniPnO + "nU) - Qn(/?no) 

< LniPno + a„u) - L„(/3„o) 

~ n^{pX„i\PnOj + anUjl) - Px,X\PnOj\)} 

= (/)+(//). 

Then by Taylor's expansion we obtain 

(/) = anV^Ln{(3no)u+^u^V'^Ln{(3no)ual + iV^{u'^V2L„(/3*)u}ua3 

= /l + /2+/3, 

where the vector /3* lies between /3„,o and PnO + QnU, and 

{II) = -^[nanp'x^ilPnOjl) sgn{PnOj)uj + nalp'^{(3nOj)u]{l + o(l)}] 

= h + h- 
By condition (F), 

, , |/i| = |a,iV'^L„(/3„o)u| <a„||V'^L„(/3„o)|| ||u|| 

(5.2) 

= Op{any/np^)\\u\\ = Op(a^n)||u||. 
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Next we consider I2 . An application of Lemma 8 in the Appendix yields that 



(5.3) 



T 1 T 

^ 2 

1 



1 



n 



{V''Lnif3no) - EV^Ln{f3no)} 



u • na^ 



u^ IniPno)^ ■ nal 



-^u^/„(/3„o)u + Op(l) • na^lluf. 



By the Cauchy-Schwarz inequality and condition (G), we have 

1 ^ 8i„(/3;) 



T.T. 



nil d^ni dPnj dPnk 



n ( Pn 



UiUjUka^ 



1/2 



^7^J2\ Y. Ml^kiVni)} \\ufal 



6 



1=1 \i,j,k=l 



Since p^/n — > and p^a„ — > as n — > 00, we have 
n ( Pn ~i 1/2 



Z=l U,i,A:=l 



I 11'^ S 



Op{pf/'^an)nal\\u.\\'^ = Op{nal)\\u\ 



Thus, 
(5.4) 



/3 = o„(nQ„)||u 



PV^nl 



The terms I4 and I^ can be dealt with as follows. First, 

(5.5) I/4I <Y^\nanp'x^{\(3n0j\)&S^{f3n0j)uj\ < ^/s^ ■ nanan\\^\\ <nQ^||u| 



and 



(5.6) /5 = 5]naM„(l/3nO,l)u|{l + o(l)}<2. max pl^{\(i, 

j=l i-SjSSn 



■„oj|j-"-anl|u||^. 



By (5.2)-(5.6) and condition (C), and allowing ||u|| to be large enough, all 
terms /i, 13,14 and Is are dominated by I2, which is negative. This proves 

(5.1). n 



To prove Theorem 2, we first show that the nonconcave penalized estima- 
tor possesses the sparsity property /3„2 = by the following lemma. 
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Lemma 5. Assume that conditions (A) and (E)-(H) are satisfied. If 
Xn — > 0, \/n/pnXn -^ oo audp^/n — > as n^ oo, then with probability tend- 
ing to I, for any given /?„i satisfying \\j3ni - PnmW = Op{^/pJn) and any 
constant C , 

Q{(/3^i,0f} = ,, ,, max ^^^ Q{(/3^i,/3^2f }• 

IIA.2||<C(p„/n)i/2 



Proof. Let e„, = Cy/p^Jn. It is sufficient to sliow tliat witli probability 
tending to 1 as n — > oo, for any (5ni satisfying /3„i — /3„oi = Op{^Pn/n) we 
liave, for j = s„ + 1, . . . ,p„ 



-'m 



(5-7) ^^#^<0 forO</3„,<e„, 

(5.8) ^^M > for - .„ < /3„, < 0. 
By Taylor expansion, 

- VA„(l/3nj|)sgn(/?„j) 

= /l+/2 + /3 + /4, 

where /3* lies between /3„ and /?„o- Next, we consider Ji, /2 and Is. 
By a standard argument, we have 

(5.9) li = 0p(^) = 0p(^^^). 
The term I2 can be written as 

^f 9^L„(/?„o) a2L.(/?.o) ] 

, x^ (9^Ln(/3„o) 
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Using the Cauchy-Schwarz inequality and ||/3n — /?no|| = Op{\Jpn/n), we 
have 



\Ko 



Pn 



n 



^In{Pno)iJ,l)iPnl - PnOl) 



1=1 



Pn 



1/2 



<nOjJ^){j:iliM{j,l) 



d=l 



As the eigenvalues of the information matrix are bounded according to con- 
dition (F), we have 



X:i^(/3n0)(j,0=O(l)- 

1=1 
This entails that 

(5.10) K2 = Op(V^^). 

As for the term Ki, by the Cauchy-Schwarz inequality we have 



|i^l|<||/3n-/3n0| 



^f d^LniPno) _ ^d^LniPno) ' ^ 



11= 



^ I dPnj dPnl dPnj d^nl 



1/2 



Then from condition (F), it is easy to show that 



f^^ I dl3nj dj3nl dPnj dPnl 



1/2 



Opi^/nji^). 



By \\Pn — PnoW = Op{y/pn/n) it follows that Ki = Op{yJnpn)- This, together 
with (5.10), yields 

(5.11) h = Op{,/^^). 

Next we consider Lj,. We can write I^ as follows: 



E 



Lk= 



-Ettt^ — t: — f{(inj — PnOj)iPnk — PnOk) 



\ I dPnj PnlPnk d(3nj Pnl A 



nk 



"*" E ^ piQ .Q .2 . (^"J ~ PnOj){Pnk - PnOk) 



By condition (G), 

(5.12) \K^\<cl'^-npn-l 



/3„of = Op{pl) = Opiy/np:;;,). 



3n — /3no|| • 
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However, by the Cauchy-Schwarz inequality, 

'^ ~ ij^^y dPnj d(3nk d(3nl d(3nj dPnk d(3nl J 

Under conditions (G) and (H), we have 

(5.13) i^3 = Op|(np^,^) } = Op(V^^) 

From (5.9) and (5.11)-(5.13) we have 

h+h + h = Op{y/np^,). 



Because \Jpn/n/\n —>■ and hm inf „_,oo inf 5i_>o+ Pa i^)/^n > 0, from 



1 ^ Sgn{(3nj) + OpiJ — / Xn j 



dPnj 

it is easy to see that the sign of /3„j completely determines the sign of 
dQniPn)/d(3nj. Hence, (5.7) and (5.8) follow. D 

Proof of Theorem 2. As shown in Theorem 1, there is a root-(n/p„,)- 
consistent local maximizer /?„ of QniPn)- By Lemma 5, part (i) holds that 
Pn has the form {f5ni,0)'^- We need only prove part (ii), the asymptotic 
normality of the penalized nonconcave likelihood estimator Pni- 

If we can show that 

{IniPnOl) + ^xJiPnl - PnOl) + b„ = -VL„(/3„oi) + Opin'^/^), 

n 
then 

V^AnI-^'^{l3rm){In{Prm) + ^xJ[Kl " /^nOl + {/„(/3nOl) + ^xJ'^W] 



/n 
By the conditions of Theorem 2, we have the last term of Op(l). Let 



Y^i = -^AJ~^/\pnOl)VLm{PnOl), i = 1, 2, . . . ,n. 

It follows that, for any e, 

n 
i=l 

<n{i?||y„if}i/2{p(||y„i||>e)}V2. 



24 



J. FAN AND H. PENG 



By condition (F) and AnAf^ — > G, we obtain 



P{\\Yni\\>e)< 



E\\AnIn (/?nOl)VL„i(/3, 



'nOly 



ne 



0{n~^) 



and 



E\\Ynit = ^^P„/-i/2(/?„oi)VL„i(/3„oi)||^ 



< — Amax(^n.A„)Aniax{-fn.(/3nOl)}-E^||V Lnl(/3nOl)VLnl(/3nOl) || 



n^ 



O 



Thus, we have 

On the other hand, as AnA^ — > G, we have 

n 
^COv(y„i) = nCOv(y„i) = COv{AnI-^/^{PnOl)yLnl{PnOl)} ^ G. 
i=l 

Thus, Yni satisfies the conditions of the Lindeb erg-Feller central limit theo- 
rem [see van der Vaart (1998)]. This also means that I / ^/nAnlniPnOi)"^ '^ LniPnOi) 
has an asymptotic multivariate normal distribution. 

With a slight abuse of notation, let QniPni) = Qn(/3ni,0). As /3„i must 
satisfy the penalized likelihood equation VQniPni) = 0, using the Taylor 
expansion on VQn{(3ni) at point f3noi, we have 

^-[{V2l„(/3„oi) - '^^Px„if3*ni)}0ni " /3„oi) " VPa„(/5„oi)] 



n 



VL„(/3„oi) + ^(/3ni - /3noi)^V2{VL„(/3:i)}(/3„i - /3„,oi) 



where /5*]^ and /?** lie between /3„,i and /3„oi- Now we define 

£ = V2l„(/3„oi) - V2p,J/3**) 
and 

C = ^(/3nl - PnOlf^H^Ln{Pnl)}0nl " /^nOl)- 

Under conditions (G) and (H) and by the Cauchy-Schwarz inequality, we 



have 



(5.14) 



-C 



n 



2 -^ n 






n^ 



i=l 



Op(^)OpiPl) = oJl 
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At the same time, by Lemma 8 in the Appendix and because of condition 
(H), it is easy to show that 



A,: 



\ -£+IniPnOl) + ^\n \ =Op(^^], i = l,2,...,Sn, 



'nOl 



where Xi{A) is the ith eigenvalue of a symmetric matrix A. As Pni — A 
Op{^/pn/n), 

(5.15) |^£ + /„(/3„oi) + ^xAiPni - PnOl) = Op (-^) . 
FinaUy, from (5.14) and (5.15) we have 

(5.16) {IniPnOl) + ^xJiPnl " /3n,0l) + b„ = -V Ln{PnOl) + Op( ^ 

Following (5.16), Theorem 2 follows. D 

Proof of Theorem 3. Let An = -n~'^V'^ Ln0ni) + ^XniPni), ^n = 
cov{VL„(/3„i)}, A = IniPnOi) + ^\„ and B = IniPnOi)- Then we have 

S„ - S„ = An\l3n - 13)An^ + {An^ - A~^)BA~^ + A~^B{An^ - A^^) 

= h + l2 + h 
and 

Let Aj(^) be the ith eigenvalue of a symmetric matrix A. If we can show that 
\i{A — An) = Op(l) and \i{Bn — B) = Op(l), then from the fact that |Ai(S)| 
and |Ai(^)| are uniformly bounded away from and infinite, we have 

Ai(Sn -S„) =Op(l). 

This means that S„ is a weakly consistent estimator of S„. 
First, let us consider A — An and decompose it as follows: 

A-An = IniPnOl) + -V^Ln^nl) + Sa„(/3„0i) " ^X„0nl) = Ki + K2. 

n 
It is obvious that 

Xmm{Ki) + Amin(-?^2) < Amin(A'i + K2) 

< Amax(i^l + K2) < Ai^ax(^l) + Ke.AK2). 

Thus, we need only consider Xi{Ki) and Xi{K2) separately. The term Ki 
can be expressed as 

Kl = I„(/3„0l) + -V^Ln{f3nOl) - -V^Ln{f3nOl) + -V^Ln{(3nl). 

n n n 
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According to Lemma 8 in the Appendix, we have 



(5.17) 

As shown in Lemma 9 



IniPnOl) H V LniPnOl] 

n 



Op(l). 



(5.18) 



V^Lnifinl) V2L„(A 



'nOl) 



n 



n 



Or. 



n 



:Op(l). 



Thus, it follows from (5.17) and (5.18) that ||-ft'i|| =Op(l). This also means 
that we have 



(5.19) 



Xi{Ki) = Op{l), i = l,2,...,s„. 



As ll^ni - /3„oi|| = Op{^/pn/n), by condition (D), PxJPnj) - p'x^iPnOj) = 
Op(l), that is, 

(5.20) Xi{K2) = Op{l), i = 1,2,..., sn- 
Hence, from (5.19) and (5.20) we have shown that 

(5.21) X,{A-An)=Op{l), i= 1,2,..., Sn. 

Next we consider Xi{Bn — B). First we express Bn — B as the sum of 
i^3 and i^4, where 

-^3 = S - 2^ ^^5 ^;^ } - IniPnOlj 



and 



K. 



n fr{ dPj dPk 



1 y^ dLnijPnl) 1 / ^ Y^ dLnijPnl) 



Using the aforementioned argument, we need only consider Ky, and -ff4 sep- 
arately. 
Note that 

^E^^l|^-^A„(l/3n,l)=0, 3 = 1,2,. ..,Sn, 

which implies that 



(5.22) 



l^4f = EEK„(l/3n,l)}VAjl/3n.|)r 

i=lfc=l 
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By Taylor expansion, 

(5.23) v'xX^nA) =p'xS\l^n,,\) +v'L{\P*n,\)0n,-l3nO,), 

where /3* • lies between f3nj and /3njo- Prom (5.22) and (5.23) we obtain 



|2 I 



i^4f < 4 EPA.(l/3nO,|)' + CfPnl " /3„,o| 



(5.24) ^ ^ 



Finally, we consider K^. It is easy to see that K^ can be decomposed as 
the sum of K^ and Kq, where 

1 .r-^ dLni{Pnl) dLniiPnl) 1 v^ dLni{PnOl) dLni{PnOl) I 






^i^l ^f^J ^^k "ill ^(^J ^^k J 

As before, following Lemma 8 in the Appendix, it is easy to demonstrate 
that 

(5.25) \\K6\\=0p{l). 
In the Appendix we show that 

(5.26) ||i^5||=0p(l). 

By (5.24)-(5.26) we have shown that \\Bn — B\\ = Op(l) and 

(5.27) XiiBn-B) = Op{l), i = l,...,Sn. 

It follows from (5.21) and (5.27) that 

Aj(i;„ - S„) = Op(l), i = l,...,Sn- 

This completes the proof for the consistency of the sandwich formula. D 

Let Bn be an {sn — q) x Sn matrix which satisfies BnB^ = Is„~q and 
AnBj^ = 0. As f3ni is in the orthogonal complement to the linear space that 
is spanned by rows of An under the null hypothesis Hq, it follows that 

where 7 is an [sn — (?) x 1 vector. Then, under Hq the penalized likelihood 
estimator is also the local maximizer 7„ of the problem 

Qn(5^7„) = max(5„(5^7„). 

To prove Theorem 4 we need the following two lemmas, the proofs of 
which are given in the Appendix. 
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Lemma 6. Under the conditions of Theorem 4 and the null hypothesis 
Hq, we have 



Pnl-Pn01 = -In{PnOl) ^V Ln{PnOl) + Op{n ^/^), 



Bl{% - 7no) = -Bl{BrMfirm)Blr^BlVLn{^n^{) + Op(n-V2). 
n 

Lemma 7. Under the conditions of Theorem 4 and the null hypothesis 
Hq, we have 

Qn0nl) -Qn{Bnln) 

(5-28) 

= ^(/3„i - B^%fln{PnOl)0nl " B^%) + Op(l). 

Proof of Theorem 4. Let 9^ = IniPnoi) and $„ = ^VL„(/?„oi)- By 
Lemma 6 we have 

(5.29) = e-i/2{/„ - el/^B^{BnenB^r'Bnei/^}e-'/''^r^ 

It is easy to see that /„ — Qn Bj^{Bn@nB^)~'^BnQn is an idempotent 
matrix with rank q. Hence, by a standard argument and condition (F), 

Substituting (5.29) into (5.28), we obtain 



Qn{l3nl) - Qn{Bi%) 

n 
2 



By the property of the idempotent matrix, /„ — 0.„ B'^{Bn@nB^)~^BnQn 
can be written as the product form D^Dm where D„ is a g x s„ matrix that 
satisfies DnD^ = Iq. As in the proof of Theorem 2, we have shown that 

— 1/2 

y/nDnQn *&n bas an asymptotic multivariate normal distribution, that is. 
Finally, we have 

2{Qn0nl) - Qn{Bl%,)] 



n 



{Dr,Q-'/Hnf{Dn@n'^^^n) + Op{l) ^ Xq- □ 
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APPENDIX 

Lemma 8. Under the conditions of Theorem 1, we have 

(A.l) 
and 

(A.2) 



29 



n 



1 

Pn 



n^ 9/3, 0/3, J-^-^^-O' 



Pri 



Proof. For any e, by Chebyshev's inequality, 



P 



n 



y^LniM+IniPno) 



> 



Pr. 



< 



pi ^^(dLniPno) ^dKiUno) 



n^e"^ 



Pn 



n 



0(1). 



CfPniPnj CfPniP; 



nj 



Hence (A.l) follows. Similarly, we can prove (A.2). D 



Lemma 9. Under the conditions of Theorem 2, we have 



V^LniPnl) V2L„(A 



'nOD 



n 



n 



1 



/Pn 



Proof. First we expand the left-hand side of the equation above to the 
third order, 

V2l„(/3„i) V2l„(/3„,oi) 



n n 

-,2 



1 ^ \ dLn0ni) dL^idnm) 



Then by condition (G) and the Cauchy-Schwarz inequality. 



" jj=i U=i 



^ dLniPli) 



'Ati^didlijdh 



{(ink — PnOk) 
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^;^^5iS^^^^"^""^"°'"' 



Pn ( n 



2 



"' ^ ^^ ^ i,j,k 1 1=1 

Proof of (5.26). According to Taylor's expansion, we have 

dp, - d(5, +^ 5/3, ^^-' " ^"°) 
+ (/3ni - /3no)^v2^^:^i^(/3„i - /3„o) 

The matrix K^ can then be expressed as a sum of the foUowing form: 
^5 = - y] flij^ifc + - Xl a-ijCik + - Xl ^«jfc + ~ y] Cijaik 



D 



X1+X2 + X3 + X4 + X5 + X6+X7 + X8. 



Considering a matrix of the form n ""^(X^iLi ^iji/ik) — F, we have 



IFIP 






"- \i=ij=i I \i=ifc=i y 



Thus, the order of ||Xj|| can be determined from those of Ylii=\}ljL\0'iji 
Because of condition (F), for any i and j i^a.^ < C and 

we obtain 



^1 0(3, dp, ^ -^ for any n,j and A:, 



(A.3) J2T.^'j=OpinPn) 
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and 



(A.4) ^=iJ=i i=ij=ik=i^ opjOPk 



" ""r9^n^(/?n0l)l „^^^_^^^^|,2 



0,{npi)0,r-^]=0,{pi). 



n 
By condition (G) and using the Cauchy-Schwarz inequality, we show that 

^n ^n Sfi i' ^ 7" / Q-^ \ -^2 



f dLniiPnl) [ ||A ^ ||4 

^nl — PnOlll 



(A.5) 

From (A.3)-(A.5) we have 

lli^sf <8(||Xi||2 + ... + ||X8f) 

< 8 ;^ I Op(np„,p3 ) ^ Q^ Lp„ ^ j 

= Op(^)=«p(l)- 
This completes the proof. D 

Proof of Lemma 6. We need only prove the second equation. The 
first equation can be shown in the same manner. Following the steps of the 
proof of Theorem 2, it follows that under Hq, 

BnilniPnOi) + ^xjB^{% " 7no) " S„b„ = -B„VL„(/3„oi) + Op{n-^/^). 

n 

By the conditions a.„ = Op{l/^npn) and BnB^ = Is„-q, we have 

||^nb„|| < ||b„|| < VP^fln = Op(n~^/^). 
On the other hand, since 6„ = Op(l/y^^), we obtain 

ll^nSA„-B^(7n-7no)|| < ||7n - 7no||^n = Op ( ^= ) Op ( J— ) = Op( ^ 

\\/Pn/ \\ n / Kyn 

Hence, it follows that 

BnIniPnOl)B^{% - 7no) = -5nVL„(/3„oi) + Op{n-^/^). 
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As Xi{BnIn{PnOi)B^) IS Uniformly bounded away from and infinity, we 
have 



Bi {% - 7no) = Bi {BnIni/3nm)Bi}-'BnVL„{PnOi) + Op{n"'^'). 
This completes the proof. D 

Proof of Lemma 7. A Taylor's expansion of Qn0ni) - Qn{B'^%) at 
the point /3„i yields 

Qn0nl) - Qn{Bl%) = Ti + Ta + T3 + T^, 

where 

^3 = ^V^{(/3„i - S^7n)^v2L„(/5:i)(/5„i - Bl%)]0ni - B^%), 

Ti = i(/3nl - Bl^nVV^PxSklW + O{l)]0nl " i?^7n). 

Note that Ti = as V^Q(/3„i) = 0. By Lemma 6 and (5.29) it follows that 

By the conditions 6„ = Op[l/ ,Jp^) and q <pn, following the proof of I3 in 
Theorem 1, we have 



n = Opinpi/'n-'/V/')=Opil) 



and 



^4 < nbnWPni - -Bj7n|P =no. 



1 



0„ 



TnJ \n 



:Op(l). 



Thus, 

(A.6) QniPnl) - Qn{B^%) = T2 + Op{l). 

It follows from Lemmas 8 and 9 that 



n 



1 



/Pn 



Hence, we have 
1 



(/?„1 - B^%y {V'LniPnl) + nIn{PnOl)}{Pnl " B^%) 



T - 

n 



<Op[n——z] Opi-] =Op(l). 



The combination of (A.6) and (A.7) yields (5.28). D 
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